Data input and retrieval apparatus

ABSTRACT

Input apparatus for a data processing system includes a processor, storage and graphical display in which a free-form source document is input and processed to parse a source document to locate semantically meaningful entities and to store corresponding content data. The graphical display is arranged to generate a visual representation of the source document in which the semantically meaningful entities are represented by pictorial elements.

BACKGROUND OF THE INVENTION

This invention relates to method and apparatus for the input of datainto computers and, in some embodiments, to subsequent retrievalthereof. Particularly but not exclusively, in one embodiment theinvention relates to the input of data to, and data retrieval from, adatabase, and in another to the input of data defining a specificationfor a computer program.

1. Field of the Invention

The problem of providing communication between humans and computers hasoccupied those in the fields of computing hardware and software sincethe birth of computing. For decades, the goal has been to providecomputers which can communicate with a human “naturally” byunderstanding free-form speech or text input. However, despite continuedprogress, this goal has not been reached yet.

Human-computer interaction is used for many things. For example, it isused to input immediate instructions for action by a computer (which isat present mainly provided by the combination of a cursor control devicesuch as mouse and icons displayed on the screen, or by the use ofmenus). It is also used to input instructions for subsequent execution(which is mainly achieved at present by forcing human beings to usetightly constructed programming languages or descriptive languageswhich, despite their superficial resemblance to human languages, bearlittle relationship to way human beings actually communicate). Finally,it is used for data storage and retrieval (which is at present typicallyperformed by storage of a textual document, and retrieval by searchingfor the occurrence of character strings within the document).

Those skilled in the art have approached this problem by the developmentof artificial intelligence techniques, with the aim either of providinga sufficiently comprehensive set of rules that a machine can eventuallyunderstand natural language input, or of providing a “self learning”machine capable of developing the same ability by repeated exposure tonatural language.

SUMMARY OF THE INVENTION

In one aspect the present invention seeks to address the same technicalproblem, but from a different direction. In the present invention, aninput document (which may be spoken or in text form, or indeed in anyother form representative of natural language) is input to an inputapparatus (which may be provided by a general purpose computer) and isanalysed, to separate the meaningful concepts within the document andrecord these together their inter-relationship. The present inventionhas this in common with most attempted artificial intelligence systems.

For example, EP-A-0118187 discloses a natural language input systemwhich is menu driven, allowing a user to select one word at a time froma menu, which prompts the next possible choices based on what haspreviously been input.

U.S. Pat. No. 5,677,835 discloses an authoring system in which documentsto be translated are input and analysed, and where ambiguities aredetected, the user is prompted to resolve them.

In this aspect of the invention, however, these meaningful entities (forexample, the concepts described by nouns) are displayed on an outputscreen, in a graphical form, which represents them as separate icons andmeaningfully indicates their interconnection or relationship.

This apparently simple step provides a number of benefits. The first isthat it gives immediate feedback to the human inputting the data of the“understanding” gained by the computer. Natural human language is fullof ambiguities which, normally, human beings are readily able to resolvewithout conscious thought because of their shared knowledge base, whichare at best ambiguous and, at worst, mis-recognised by a computer.

To take an English example, “Mary was kissed by the lake” is ambiguous,since it can be interpreted either as indicating that the lake is theactive party (the kisser) or that the lake is the location at which Maryis kissed by an (unknown) active party.

Whereas a human immediately understands the correct meaning, and may noteven see the presence of an ambiguity, a computer is unable to do sounless programmed by a rule or conditioned by experience.

By displaying the construction understood graphically, however, thepresent invention enables the avoidance of such ambiguities which areimmediately recognisable to the user.

Very preferably, the invention provides a graphical user interface toenable the user to manipulate the graphical display, and means tointerpret the results of such manipulation. Indeed, it would inprinciple be possible to allow the user to directly input the documentgraphically without previous direct document input (although this is notpreferred, for reasons of speed, for most applications).

Thus, the user is able to allow the computer to extract as much meaningas possible from the input document and then to correct the ambiguitiesor errors graphically.

The invention will be understood to differ from so called “visualprogramming” systems, as described, for example, in EP-A-0473414. Suchvisual programming systems provide a graphic environment in whichoperations to be specified are represented visually, and a user mayspecify a sequence of such operations by editing the display to createand alter linkages between the elements. However, in visual programming,as in other known methods of creating or specifying programs, the useris constrained to select from a limited number of predefined operationsand connections therebetween. By contrast, the present invention acceptsdocuments as input and analyses the documents to provide the graphicaldisplay which may subsequently be edited.

The resulting semantic structures, corresponding to the graphicalrepresentation (corrected where necessary), are stored for subsequentprocessing or retrieval. In one embodiment, data retrieval apparatus isprovided. In another embodiment, the stored data is employed by a codegenerator, to generate a computer program.

The invention is advantageously used for this latter application,because the detection of ambiguities eliminates one of the difficultiesin existing software specification and automatic code generation fromsuch specifications.

In either case, in the preferred embodiment there is a stored lexicaltable which stores data relating to the meanings of words which will beencountered in the source document (analogously to an entry in an wellstructured dictionary).

Preferably, in this case, the apparatus is arranged to perform“reasoning” utilising this semantic information, by comparing themeanings of groups of words (i.e. clauses or sentences) of the documentto locate inconsistencies, or by performing the same operation betweenmultiple different documents.

This is particularly advantageous in embodiments where the sourcedocument is to act as a specification for the generation of computercode, because it enables the location of conflicting requirements.

Since the present invention, in this embodiment, has some“understanding” of the “meaning” of words, it is able to store thecontent data (for example in the form of semantic structuresrepresenting groups of words such as clauses or sentences) by referenceto such “dictionary entries”—i.e. by reference to their “meaning”,rather than the source language word which was input. This makes itpossible to use a multilingual embodiment of the present invention,where the lexical entries are mapped onto corresponding words in each ofa plurality of languages, so that data may be input in one language andoutput in one or multiple different languages.

In embodiments for data retrieval, or similar applications, each suchlexical entry may have an associated code indicating the “difficulty”,“obscurity” or “unfamiliarity” of the concept described. For example,concepts may be labelled as familiar to children upwards; familiar toadults; or familiar only to particular specialists such as physicists,chemists, biologists, or lawyers.

With knowledge of the level of familiarity of the data retriever, thepresent invention is in this embodiment able to utilise such ratings tooutput data appropriate to the understanding of the retriever so as notto output information which is too facile for an advanced user, or toocomplex for a casual user.

Different semantic elements may be associated, explicitly or implicitly,with an access level rating. Thus, for example, classified items may beavailable for access only to properly identified users; “adult only”items may be classified as unavailable to identified children; andproper names may under some circumstances be suppressed (for example,the name of parties to litigation).

By associating a classification with each item, rather than withdocuments or materials as a whole, a much finer-grained control ofinformation is obtained.

In data retrieval embodiments, the data retrieval apparatus preferablycomprises a natural language generator, for generating a document fromsemantic structures produced as described above. This has severaladvantages over the mere supply of corresponding portions of theoriginal document.

Firstly, as described above, it provides the possibility of amultilingual embodiments since different generators may, from the samesemantic structure, generate text in different languages.

Secondly, where access codes are employed as described above, thegenerator may in preferred embodiments be able to re-generate readabletext from a reduced amount of information (for example, by using thepassive voice where a name is suppressed, rather than the active voice).This aspect of the invention is also useful separately of the data inputmethods above.

The input and/or output according to various embodiments of theinvention may be in the form of speech, text or animated video, in whichcase the input and/or output apparatus comprises, as appropriate, speechrecognisers and/or synthesisers; text input and output; and image pickup and analysis/video generation apparatus.

BRIEF DESCRIPTION OF THE DRAWINGS

Other aspects and preferred embodiments are as described in thefollowing description and claims.

Embodiments of the invention will now be illustrated, by way of exampleonly, with reference to the accompanying drawings, in which:

FIG. 1 is a block diagram of apparatus according to a first embodiment;

FIG. 2 is a block diagram showing in greater detail the processespresent in a client terminal forming part of the embodiment of FIG. 1;

FIG. 3 is a block diagram showing in greater detail the processespresent in a server forming part of the embodiment of FIG. 1;

FIG. 4 is a block diagram showing in greater detail the processespresent in the server of FIG. 3;

FIGS. 5a and 5 b are illustrations of output displays produced accordingto the first embodiment;

FIG. 6a is a flow diagram showing the process of data input according tothe first embodiment performed by the terminal of FIG. 1;

FIG. 6b is a flow diagram showing the process of data input performed bythe server of FIG. 1;

FIG. 7 is a diagram illustrating the data stored in the terminal of theFIG. 1;

FIG. 8 is a flow diagram showing the process of text analysis performedby the server of FIG. 1;

FIG. 9a is a block diagram showing the stored data used for the analysisof FIG. 8;

FIG. 9b shows the storage of data generated by the analysis of FIG. 8;

FIG. 10 is a diagram illustrating the results of the analysis of FIG. 8;

FIGS. 11a and 11 b are diagrams showing the display produced by thefirst embodiment from the data stored in a lexical database forming partof FIG. 9a;

FIGS. 12a and 12 b are screen displays showing the graphicalrepresentation of stored data at the terminal of the first embodiment,and the corresponding text generated therefrom;

FIG. 13 is a block diagram illustrating the hierarchical arrangement ofdata stored within the lexical database forming part of FIG. 9a;

FIG. 14 is a diagram showing the contents of a word record within astore forming part of FIG. 9b;

FIG. 15a is a flow diagram showing the process of data retrievalperformed by the terminal of FIG. 1;

FIG. 15b is a flow diagram showing the process of data retrievalperformed by the server of FIG. 1;

FIG. 16 is a flow diagram showing the stages of text generationperformed by the server of FIG. 1 as part of the process of FIG. 15b;

FIG. 17 is a diagram showing the screen display of the results of dataretrieval in textual form;

FIGS. 18a and 18 b show the corresponding graphical representationthereof;

FIG. 19 corresponds to FIG. 9a, and shows the data stored for a thirdembodiment of the invention utilising multiple languages;

FIG. 20 is a block diagram of a terminal utilised in a fourth embodimentof the invention and corresponding to that shown in FIG. 1; and

FIG. 21 is a flow diagram showing the process of code generationaccording to a fifth embodiment of the invention.

FIG. 22 shows a screen display of stored text according to a sixthembodiment;

FIG. 23 is a flow diagram showing schematically the process of inputtingtext according to the sixth embodiment;

FIG. 24a replicates the display of FIG. 22 with a word highlighted; and

FIG. 24b shows a display of multiple possible meanings for the word;

FIG. 25 is a flow diagram showing schematically the process ofretrieving stored text which has been input by the process of FIG. 23;and

FIG. 26 (comprising FIG. 26a and FIG. 26b) shows the transformation ofinput text (corresponding to a portion of that shown in FIG. 22) tostored records according to the sixth embodiment of the invention.

DETAILED DESCRIPTION OF EXEMPLARY EMBODIMENTS

First Embodiment

A first embodiment of the invention is an Internet-based informationstorage and retrieval system.

Referring to FIG. 1, the present invention may be provided by a clientterminal 100 a connected via a telecommunications network 300 such asthe Public Switched Telephone Network (PSTN) to a server computer 200.The terms “client” and “server” in this embodiment are illustrative butnot limiting to any particular architecture or functionality.

The client terminal comprises a keyboard 102, a VDU 104, a modem 106,and a computer 108 comprising a processor, mass storage such as a harddisk drive, and working storage, such as RAM. For example, a SUN™ workstation or a Pentium™ based personal computer may be employed as theclient terminal 100 a.

Referring to FIG. 2, stored within the client terminal (e.g. on the harddisk drive thereof) is an operating control program 110 comprising anoperating system 112 (such as Windows™), a browser 114 (such as WindowsExplorer ™ Version 3) and an application 116 (such as a Java™ applet),designed to operate with the browser 114. The function of the operatingsystem 112 is conventional and will not be described further. Thefunction of the browser 114 is to interact, in known fashion, withhypertext information received from the server 200 via the PSTN 300 andmodem 106. The browser 114 thereby downloads the applet 116 at thebeginning of the communications session, as part of a hypertext documentfrom the server 200.

The function of the applet 116 is to control the display of receivedinformation, and to allow the input of information for uploading to theserver 200 by the user, through the browser 114.

In this embodiment, which concerns data input and access, the applet 116comprises two separate applets 116 a, 116 b; a first 116 a for datainput and a second 116 b for data retrieval.

Referring to FIG. 3, the server 200 comprises a communications port 202(e.g. a modem); a central processing unit 204 (e.g. a mainframecomputer) and a mass storage device 206 (e.g. a hard disk drive or anarray of disk drives).

Referring to FIG. 4, the server 200 comprises an operating program 210comprising an operating system 212 such as Unix™, a server program 214and an application program 216. The operating system is conventional andwill not be described further.

The function of the server program 214 is to receive requests forhypertext documents from the client terminal 100 a and to supplyhypertext documents in reply. Specifically, the server program 214initially downloads a document containing the applet 116 for the clientterminal 100 a. The server program 214 is also arranged to supply datato and receive data from the application program 216, via, for example,a cgi.bin mechanism or Java Remote Method Invocation (RMI) mechanism.The application program 216 comprises a first application 216 a for datainput and a second application 216 b for data retrieval. Each receivesdata from a client terminal 100, performs processing, and returns datato that client terminal for display.

Data Input

Overview

Referring to FIGS. 6a and 6 b, an overview of the operation of datainput will now be given.

On a signal from the user to initial data input (e.g. by selecting anicon on the terminal 100), in step 302, the browser 114 accesses theserver 200 via the PSTN 300 in conventional manner.

In step 322, the server 200 receives the access request, which is passedvia the operating system 212 to the server program 214 and thence to theapplication 216 for handling. The application 216 retrieves a copy ofthe applet 116 from the store 206, and transmits it via the PSTN 300 tothe terminal 100 in step 324.

In step 304, the terminal 100 receives the applet 116 and the browser114 loads the applet 116 into memory and causes it to execute.

Referring to FIG. 5a, the applet then displays a data entry screencomprising a text entry window 402.

In step 306, the applet 116 a then allows text to be typed in to thedisplay area 402 and buffered locally within the terminal 100.

On a signal by the user to commence analysis (e.g. by selecting thebutton 404 with the input device 102), the applet 116 sends the textwhich is displayed within the text entry box 402 to the service 200.

In step 326, the service 200 receives the text, which is passed to theapplication 216. In step 328, the application 216 performs a semanticanalysis upon the text, as will be described in greater detail below, tolocate and identify within the text the following elements:

a) meaningful semantic entities, typically denoted by nouns. Forexample, in the example shown, the semantic entities are “satellite”,“number” and “transponders”.

b) the form of each of the entities (e.g. whether it is singular orplural), and whether it is in the definite or indefinite form. In theexample shown in FIG. 4a, “the satellite” is singular, and “the”indicates that it is the definite article. “Number” is singular, and “a”indicates that it is in the indefinite form. “Transponders” is in theplural form, and the lack of determiner after “of” indicates that it isin the indefinite form.

c) “States of affairs”—generally indicated by verbs. States of affairsindicate either actions, as most verbs do, or states of being (e.g. theverb “to be”). In this example, “carry” and “can” are states of affairs.

d) The conditions attached to each state of affairs (e.g. the tense ofthe verb concerned).

e) Modifiers (e.g. adverbs or adjectives) which ascribe properties orotherwise modify an entity or state of affairs.

f) The linkages between the occurrences of the foregoing (e.g. whichentities a state of affairs affects and how; and which entities or stateof affairs a modifier modifies).

Each detected state of affairs, entity or modifier is represented by astored entry (to be described in greater detail) in the store 206.

The application 216 then generates display control data for transmissionto the terminal 100, comprising a list or string of records eachcorresponding to a recognised entity, state or affairs or modifier, andcontain the name of the object (i.e. text to be displayed) whichindicates the item recognised (e.g. the word which was typed in, such as“satellite” or “carry”), together with pointer data indicating thesemantic connections therebetween, and data indicating the form thereof(e.g, as discussed above, singular or plural etc.).

This data is transmitted in step 330 to the terminal 100 and receivedthereat at step 310; on reception, it is passed to the applet 116 awhich stores it. Conveniently, the applet 116 a is an object-orientedprogram, arranged to store class data relating to a state of affairsclass; a modifier class; and an entity class, and the data comprises aset of records each of which is interpreted as an object instantiating arespective one of the classes. Associated with each of the classes isdrawing code, enabling the applet 116 to cause the corresponding objectto be drawn.

FIG. 7 shows the form in which three such objects 510, 520 . . . , 540are stored; each comprises a field (512, 522) for draw attributes (suchas position and size) used to draw the object; a field 524 storing themeaning (e.g. a word representing the noun, vowel or modifier the objectshows); a field (516,526) storing the parameters of the semantic entitythe object represents (e.g. whether it is single or plural and so on);and a pair of pointer fields (518,519; 528,529) storing, respectively,pointers to one or more objects to which the object is linked, and toone or more objects from which the object is linked. From these pointerfields, the interconnection between objects is derived, and graphicallyrepresented as shown in FIG. 5b.

Referring to FIG. 5b, an analysis display area 406 is provided, in whichthe applet program 116 a draws each of the objects received from theserver 200, represented by a shape (which is visually different forentities, modifiers and states of affairs, for example by being adifferent shape and/or a different colour) and, within each, representsthe name of that object (which generally corresponds to the text whichwas input). Additionally, the applet 116 a draws graphical linkages(shown as arrowed lines) indicating the semantic connections between theobjects, using the pointer data within each object, and a drawingprogram for drawing an arrowed line to represent the pointer data.

The result is, as shown in FIG. 5b, that the results of the textanalysis performed by the application 216 are shown graphically, forinspection by the user, in a form which enables immediate recognition ofany misunderstanding by the analysis program.

Upon display of the semantic graph display data within the displaywindow 406, the applet 116 a is operable to allow the user to edit thegraphical display, specifically by “selecting” linking pointerrepresentations (e.g. using a mouse) and deleting them (e.g. byoperating a “delete” key on the keyboard 102); and to add new pointers(e.g. by “selecting” a first displayed object; moving the displayedcursor to a second; and indicating that a link is to be formed byoperation of the mouse or keyboard 102).

Thus, where an incorrect relationship between displayed objects has beenunderstood by the application 216, the user may re-draw the graph torepresent the correct relationship. The applet 116 is arranged to editthe pointer data (528,529;518,519) held within each locally storedsemantic object, to update changed linkages.

Provision is also preferably made to enable the deletion of displayedobjects where the user wishes to do so.

Additionally, it is preferred that the applet 116 a providesconventional graphical user interface options for moving displayedobjects and/or re-sizing displayed objects; in this case, the applet 116a alters the values of locally stored display attributes (512;522) ofeach object.

On finishing the editing step 314 thus described, the user indicatesthat the edit is complete (e.g. by operating a key on the keyboard 102)and, in response thereto, the applet 116 a sends the edited object data(not including display attribute data (512;522) discussed above) to theserver 200 in step 316.

In step 332, the server 200 receives the edited data, and stores thedata in the store 206 in step 334.

Analysis Process

Further details of the operation of the application 216 in performingthe analysis step 328 will now be described.

In a step 602 of FIG. 8, the text received from the terminal 100 ispre-processed, to detect the beginnings and ends of words (by thepresence of a space); the presence of punctuation; the presence ofcapital letters (indicating either a proper name or the beginning of asentence); and the presence of numeric or other special characters. Thepre-processed text, in which words, numbers, and so on are separated andflagged is stored in a text buffer 222 within working memory 220provided within the store 206 as shown in FIG. 9b.

If one or more entries are found, corresponding to different meanings,entries for each meaning are stored (as alternatives), with a pointer tothe relevant lexical table entry.

Next, in step 604, an expansions database 232 provided within fixedmemory (e.g. a disc drive) 230 within the store 206, is used to map theexpanded form of the word onto its root form. For example, plural,masculine or feminine forms of a word are detected, and replaced by theroot form of the word and a flag indicating the relevant expansion (e.g.the sex, tense or other form). The result is stored in the text buffer222.

Next, in step 606, each word is looked up in a lexical database 234 (tobe described in greater detail). If no entry is found in the word forthe lexical database for a word, then in step 608, a query as to themeaning of the word is stored for later use.

Next, in step 610, a set of stored grammar rules held in a grammar rulesdatabase 236 is accessed, and in accordance with the grammar rules, thewords held in the text buffer are parsed. Our earlier application numberPCT 97186887.6, filed on Aug. 8, 1997 (and corresponding PCT applicationPCT/GB98/02389 filed on Aug. 7, 1998), discloses details of one methodof parsing which may be employed herein and is incorporated herein byreference in its entirety. Further information on suitable parsingtechniques which may be used will be found in James Allen, “NaturalLanguage Understanding”, Second Edition, Benjamin Cummings PublicationsInc, 1995.

Briefly, each of the separately stored elements within the text buffer(i.e. each corresponding to the root form of a word, or a number, orother semantic entity) is processed, to apply the grammar rules withinthe grammar rules database to the word.

The rules specify the manner in which words can be connectedgrammatically in the language concerned, and thus, for example, wherethe word is a noun, a rule may specify that it should be preceded by adefinite or indefinite article.

Accordingly, for each word, the other words in the text buffer arereviewed to determine whether they can be combined with that word,according to the grammar rules, to produce a grammatically correctstructure. For each possible substructure thus produced, the grammaticalrules are then applied again, to combine the substructure with otherwords or substructures. In this way, as shown in FIG. 10, a chartstructure in which the original words are related by syntacticconnections is assembled.

Some possible substructures thus temporarily created will be rejected,because it is not possible to combine them with the remaining words andsubstructures to produce an entire grammatically correct sentence orphrase (represented by a single path through the nodes of the chartshown in FIG. 10).

In step 612, it is determined whether no such single parse is generated,and if so a query is generated at step 608, as described below.

If a single successful parse is extracted (corresponding to a single,unambiguous, meaning of the input text) then in step 614 a single set ofoutput data are generated as a message for transmission to the terminal100.

As already described, the data comprises a list of the word meaningsdetected, together with the parameter data indicating the form of thewords and the form of the context in which they occur (i.e. whether theword is present as a single or plural item, definite or indefinitecase), as determined by the application of the grammar rules, togetherwith pointer data linking words which are modifiers, words which arestates of affairs and words which are entities.

Where (step 612) more than one successful parse was completed,corresponding to two different meanings of the input text (caused eitherby an ambiguity in the sense of one of the words, such as the ambiguityin English between the word “bank” meaning the side of a river and theword “bank” meaning a financial institution; or an ambiguity in thesemantic relationship between these words (such as, in English, theabove quoted example of “Mary was kissed by the lake”) then multipleseparate sets of data are created, one for each of the parses, fortransmission to the terminal 100.

Where such multiple sets of data are supplied, the applet 116 a isarranged to display all such alternative constructions of the inputdocument; for example, by displaying them sequentially as a user togglethrough the meanings by selecting a key on the keyboard 102 or a displaybutton on the screen (visible in FIG. 5b).

Query Handling

Where no parse is possible, either because one or more words areunrecognised (step 606) or because the words, whilst recognised, do notsufficiently obey the grammatical rules to permit parsing (step 612), aquery is generated in step 608 as a message for transmission to theterminal 100.

In the case of an unrecognised word, the message may be such as to causethe applet 116 a to display text such as:

“The following word was unrecognised. Please either re-enter the word ifit was mis-spelt, or supply the following data:”

The message is presented as an editable form, in which the user of theterminal 100 can either enter a corrected version of the word fortransmission, or substitute an alternative word, or supply sufficientinformation to create a new entry in the lexical database 234 and theexpansion form database 232 for the word.

Such data will vary from language to language, but will include thegender (where relevant to the language), expansion form data for storagein the database 232, category (i.e. modifier, state of affairs orentity), and the meaning data discussed below, providing information onthe meaning of the word.

As a preliminary step, the word may be checked using a (conventional)spell check operation, and where multiple close matches are found, eachmay be tested to determine whether it leads to a possible parse; whereone of the possibilities uniquely leads to a possible parse, this may beselected, or alternatives may be transmitted back to the terminal 100for selection of an appropriate one by the user.

Upon completion of the response by the user, either the amended word orthe new definition is transmitted back to the host computer 200, and(after updating the expansion form database 232 and lexical database 234if necessary) the process of FIG. 8 is repeated.

In the event that it was not possible to parse the input text, the querymessage may be in the form “the text input was . . . It was not possibleto understand the meaning of this text. Please review the text, andmodify it to clarify its meaning”.

The original text may be presented by the applet 116 a in an editablewindow, allowing the user to edit it either to correct any mistakes orto substitute alternative words for those used.

Lexical Database

The lexical database 234 includes an entry for each meaning of each wordin the language (or all languages), used by the system. Each entry foreach meaning includes a definition of the meaning the word in the oreach language. The entries are hierarchically ordered as shown in FIG.13. The uppermost layer of the hierarchy consists of entries for thethree categories of entities, states of affairs and modifiers. Eachcategory is then further subdivided.

The applet 116 a is arranged to be operable,(e.g. by selection of a keyon the keyboard 102, or an area of the display) to allow the input of aword by the user via the keyboard 102 for transmission to the computer200, and the applet 116 a is arranged to receive in reply a document(for example in text or hypertext) from the computer 200 and to displayit in a dictionary display area 410 shown in FIGS. 11a and 11 b.

FIG. 11a illustrates the definition data, retrieved from the lexicaldatabase 234 and displayed in the dictionary display area 410, for theword “satellite”. Entries in the database for the same word include dataindicating the relative frequency of occurrence of the definitionconcerned, and the application 216 is arranged to format the differentmeanings into order by their frequency of occurrence.

As shown in FIG. 11a, the most commonly occurring meaning of satelliteis an artificial satellite (a man-made object that orbits around theearth), which is an entity (indicated by “n”); the second commonest is aperson who follows another (also an entity), the third most commonlyoccurring meaning is a celestial body orbiting around a planet or a star(also an entity); and the fourth is a modifier, indicating thatsomething is surrounding and dominated by a central authority or power.

The present embodiment uses the WordNet™ lexical database, availablefrom Princeton University, Princeton, N.J., USA, details of which are athttp://www.cogsci.princeton.edu/˜wn/. Other known databases, modified tohave the structure herein where necessary, could be used.

Each meaning is displayed together with the meanings of hierarchicallyhigher (i.e. broader) words. For example, taking the first meaning ofsatellite, a satellite is more broadly defined as equipment, and stillmore broadly defined as instrumentality or instrumentation, and stillmore broadly as an artefact (a man-made object) and still more broadlyas an object (a non-living entity) and ultimately, as an entity.

The hierarchical storage within the lexical database 234 is achieved,for example, by providing each definition entry with a pointer fieldpointing to the entry of the immediately broader category entry intowhich it falls, and so on, and a pointer field pointing to the entry ofthe immediately narrower entries falling within it, as shown in FIG. 13.

From the dictionary display area 410, the user may select one of thedisplayed meanings (e.g. by the use of a mouse) and, in response, theapplet 116 is arranged to create a new object and display acorresponding image representation thereof within a drawing frame shownin FIGS. 12a and 12 b, functioning in the same manner as the displayarea 406, described above.

Likewise, in the same way, having selected several such objects, theuser may link them as described above by pointers and, having created anew displayed structure or edited an existing one, can cause the applet116 a to upload the results to the computer 200 as described above inrelation to step 330.

It is thus possible, in this manner, for a user either to edit the datastored in relation to a document which has already been input from textas described above, or to create a new document directly without textinput.

Alternatively, when the applet 116 a is displaying a draw window asshown in FIGS. 12a and 12 b, the user may select one of the objects(e.g. using a mouse) and cause the applet 116 a thereby to send amessage to the computer 200 to request the return of a documentcontaining the data stored in the lexical database for the meaningassociated with that object, which the applet 116 is arranged then todisplay. A user may thereby determine the meaning stored in relation toa particular part of a document which has been input, and determinewhether it corresponds with the meaning which he intended.

Each entry in the lexical table, in this embodiment, also includes acode indicating (for example on a scale of 1 to 5) the difficulty,complexity or obscurity of the meaning concerned. For example, a code“1” may denote words whose meanings will be familiar to (and notobjectionable to) children below the age of 16; a code “2” may denote aword the meaning of which will be known to most adults; a code “3” mayindicate a word not in common use, and a code “4” may indicate a wordused only by a technical specialist in some field (such as law orbiology).

Data Storage

Each statement input by the user (typically corresponding to a sentence)is stored in the data store 240, in step 334. Where a singlecommunications session between the terminal 100 and the computer 200involves the input of multiple such statements, as will often be thecase, a session or document record is created within the store 240 whichincludes separate entries for each of the statements.

Each such entry consists of a list of records (conveniently implementedas stored objects). Each object of such a statement record comprises aninstance of one of the entity, state of affairs or modifier classes,storing a pointer 658 to the relevant meaning record within the lexicaldatabase 234, parameter data 660, and pointers 662, 664 to and from theother objects of the statement record, as described in relation to FIG.7.

At the same time as creating a pointer from the object to the entry inthe lexical database, a pointer from the entry in the lexical database234 to the object is also created.

Each object also comprises a field storing time-stamp information,specifying the date on which the session took place (or on which theinformation concerned was most recently modified). The entry in thisfield is supplied from the date/time function of the real time clock ofthe computer 200 when the data is stored.

Further, each object includes a field 652 indicating the name of theauthor or, at any rate, name data supplied by the user of the terminal100 in inputting the text. Finally, each object includes an accessrights field 656 including a code specifying to which classes of personthe information is to be made available.

For instance, the access code may include bits specifying that therelevant information is only to be supplied to users with an appropriatepassword; bits indicating that the information is to be supplied only toa person whose name corresponds to the author of the information; bitsspecifying that the information is not to be supplied to a person undera given age (such as 16); and bits specifying that it is a personalname. The structure of the data within one such object is shown withinFIG. 14.

In greater detail, the manner in which this data is entered into theadditional fields 652-656 is as follows. On the commencement of acommunication session at the terminal 100, the applet 116 a displays aneditable form, into which the user may add access information or anauthor name, to be applied to all text input in the session.

Additionally, during the session, the user may select any displayed wordobject, and is presented with a form for inputting access codes to beassociated with that particular object or group of objects. Finally, oneach occasion where a new word is input, amongst the other data to beinput is an access code.

Further, at the host computer 200, on storage of the data subsequent toparsing, a time stamp recording the current date is added to the timestamp field 654 of each object record created, and the author namesupplied from the terminal 100 (or in its absence, some identificationof the terminal 100 itself) is inserted in the author field 652.

Finally, some entries within the lexical database, (corresponding forexample to classified military items) may have an associated accesscodes, or part of an associated access code, indicating to whom they areto be made available, and this is copied into the access code field 656of the created object which instantiates the entry into the lexicaldatabase.

Data Retrieval

The data retrieval process according to this embodiment comprises twophases; searching and output text generation. The text generation stagemay also be used independently of data retrieval.

Referring to FIG. 15, 15 a shows the data retrieval process performed bythe terminal 100 under the control of the applet 116, and FIG. 15b showsthe data retrieval process performed by the computer 200 under controlof the application 216.

In a step 702, the applet 116 a displays a search form on the screen.

The search form includes a field for entering the searcher's name; afield for entering access rights information (such as a password orauthorisation code); a field or fields for the entry of a date range; afield for entry of an obscurity level (for example, one of the levelsreferred to above); a field for entering an output format; and fieldsfor the entry of search terms.

The display comprises an editable form within which the user enters, onrespective parts of the screen, the information for each field. Oncompletion of the form by the user (step 704) using the keyboard 102,the applet 116 b transmits (step 706) the search data to the computer200.

In step 722 (FIG. 15b) the computer 200 receives the search data. Instep 724, the application 216 scans the search terms which have beeninput and determines whether the terms are present in the lexicaldatabase 234, and accesses the record in the database 234 for each term.

If one or more search terms is not present in the database, then (in thesame way as described above in relation to data entry codes) the user isprompted to enter a definition or to correct the term.

From each relevant entry in the lexical database 234, the list ofobjects to which that term points is used to locate each stored objectcorresponding to an occurrence or instance of that term within a storedstatement (step 726).

Where multiple search terms have been input by the user, in step 704,the applet 116 is arranged to permit the user to specify therelationships between the multiple input terms. For example, the usermay be searching for storage in which a dog bites a man (or vice versa),in this case the terms “dog”, “bite” and “man” are input as searchterms, and in relation to the state of affairs “bite”, the user isprompted to specify the active and passive entities associated with thestate of affairs (in other words, to specify that the dog bites and theman is bitten). This is conveniently achieved by creating the statementwhich is to be searched for in the same manner as described above inrelation to data input.

As the entries in the lexical database are hierarchically ordered, theentry for “dog”, for example, may refer to hierarchically lower entries(for example, for “alsatian”, “collie”, and so on). In this case, theapplication 116 also locates all hierarchically lower entries in thedatabase 234, and in step 726 locates all objects which instantiatethose entries.

Having located all objects which relate to the search terms, in step728, the application 216 locates those statements which include all ofthose (for example, in this case, all statements which include a “dog”object (or an object relating to any hierarchically lower term); a“bite” object (or any hierarchically lower term) and a “man” object (orany hierarchically lower term)). It then determines whether thoseobjects are in the relationship specified by the user, so as to locateonly those statements where the dog is the biter of the man, and notthose where the man is the biter of the dog, or the terms are in someother conjunction.

If no such statements are located, then in step 730 the application 216sends a message indicating that the search was unsuccessful to theterminal 100, as will be described below in greater detail.

If one or more statements meets the search criteria then in step 732 theapplication 216 compiles a list of all such statements and in step 734the application 216 generates, from the stored object data, textcorresponding to each statement, which is sent in step 736 to the userterminal 100, as a display document (for example a hypertext document).

In step 708, the document is received at the user terminal 100 and instep 710 the applet 116 b displays the document received, including thegenerated text of the relevant statements.

In step 711, as will be described in greater detail below, the applet116 permits the user to select one or more terms from the displayeddocument (e.g. by selecting a hypertext link within the displayeddocument) and on such selection, the applet 116 b indicates back to thecomputer 200 the selected hyperlink in step 712.

On receipt of such a selection, in step 738, the application 216 revertsto step 722 to repeat the search, in the manner described above.

It will therefore be seen that according to this embodiment, the user isable to retrieve parts of documents which include predetermined entitiesin predetermined relationships, rather than searching for alloccurrences of words in conjunction (as is the case with current keyword or full text based database retrieval techniques).

Further, because a lexical database is employed in which multiplemeanings of given words are recorded, on generating a search, a user isable to select the correct and unambiguous meaning of a term with twomeanings, by utilising, for example the above data input method todefine the search criteria.

Further aspects of the data retrieval process will now be described.

Generation of Text

The process of generation of text from a semantic representation isgenerally known, for example from our above-referenced earlier patentapplication or the above reference by Allen. It generally consists ofthe reverse process to parsing and analysis, but without the ambiguityof analysis.

Thus, referring to FIG. 16, on generation of text the application 216 isarranged to apply the grammar rules stored in the database to theselected objects (step 752), and thereby to build up a stream of text.The word corresponding to each object is then inserted into the positionof that object in the stream (step 754), and the correct expanded formof each word is inserted by referring to the expansion database 232(step 756). Subsequently some text post-processing is performed (step758), to insert any conventional contractions (such as “I've” for “Ihave” in English) and properly handle proper names and numeric and dateforms.

In the present embodiment, the applet 116 b is arranged to be capable ofrequesting the generation of text from a displayed graphicalrepresentation at any point during data input or retrieval, bysignalling a list of objects from which text is to be generated to thecomputer 200 which generates the text and returns it as a document fordisplay by the applet. FIG. 12b illustrates the text thus generated fromthe data shown in FIG. 12a.

Further Details of Data Retrieval

The other data entered by the user in this embodiment is alsoadvantageously used in assisting data retrieval.

For example, date information may directly be entered to specifyretrieval of only statements input between specified dates, and authorinformation may be used to locate only information originating fromcertain authors.

Complexity information may be used to filter out the retrievedinformation. For example, where a particular retrieved statement isdetermined by the application 216 to include an object corresponding toan entry in the lexical database 234 with a high level of complexity orobscurity, that object may be omitted from those from which text isgenerated, as described below, or may be substituted by a hierarchicallyhigher, and hence more general, term (where this has a lower obscurityrating in the lexical database 234).

The applet 116 b preferably generates the search form to include twoselectable areas, for indication of “more complexity” and “lesscomplexity”, enabling the user, in response to an output document, toindicate whether future output should use material of a higher or lowerobscurity level.

Finally, the access or security information may be used to exclude fromthe generated text those objects corresponding to semantic items forwhich the access information obtained from the terminal 100 indicatesthat the user should not have access.

Generation from Partial Information

It will be apparent from the forgoing that a sentence of the type “FredSmith says that the dog bit the man” is represented by three entities(“Fred Smith”, “dog”, and “man”); and two states of affairs (“say” and“bite”). One feature of the present embodiment is that on deletion ofone object, for example “Fred Smith”, the application 216 can stillgenerate text from the remainder.

Referring to FIGS. 12a and 12 b, FIG. 12a shows the graphicalrepresentation of semantic objects, from which, on selection by theuser, the application 216 is arranged to generate text saying “JackJuraco says Hughes has received orders and requests from all regions ofthe world”, as shown in FIG. 12b.

If the “Jack Juraco” object is removed from the list of objects fromwhich text is to be generated, the application 216 can either:

Replace the reference by a hierarchically higher term (to generate “Aman says Hughes . . . ”)

Omit the object but leave in place the state of affairs “say”. In thiscase the application will generate in the passive voice “It is said thatHughes . . . ”

Omit the state of affairs “Say” also. In this case, the application 216will generate “Hughes have received . . . ”.

It will often be desirable to suppress personal names from many classesof material, such as Court reports where witnesses cannot be named, orunattributable statements from officials. This aspect of the inventionallows this type of automatic suppression without loss of theinformation, in response for example to an access code of the user ofthe terminal 100.

In the present embodiment, where statements are input by an author, onstorage of the corresponding data, the application 216 is arranged tocreate an additional entity object representing the author, and a “say”state of affairs object, so as to attribute all statements input by anauthor to that author. Such automatically created objects are accordedan access code, which inhibits their retrieval by casual users, who willusually be more concerned with the content of the statements than withtheir provenance.

Likewise, if for example a particular weapon designation is recorded inthe lexical database as being confidential or classified, or theparticular instance of that designation recorded by an object is soclassified, the application 216 may substitute a hierarchically higherterm (e.g. “missile” or “weapon”) or may generate text by omitting theterm altogether and substituting the passive voice (where possible).

Thus, this aspect of the invention allows retrieval of as muchinformation from a document as is not classified or otherwisecontrolled, and permits it to be presented in a comprehensible fashion.

Hypertext Output Format

It will be understood that this embodiment permits, and preferablyprovides, many different output formats for the document generated bythe application 216 to represent the retrieved data.

One preferred format is illustrated in FIG. 17, representing theinformation in FIGS. 18a and 18 b, in response to a search term“HS_(—)601” (a type of satellite manufactured by Hughes).

In this representation, each retrieved statement including the term“HS_(—)601” is represented as generated text. Additionally, each otherentity present in the statement for which further statements areavailable is individually represented in text below the statement, andbelow each such representation are reproduced any statements includingthat entity which are present in the same stored document within thecomputer 200.

Thus, below the statement “The HS 601 is a satellite”, the term“satellite” is represented, and the other three statements concerningsatellites in the document are generated and appended to the document.Below the first such statement, which includes the term “myriad”, thisterm is represented. No further statements including this term arepresent in the document.

Below the second statement, which includes the term “module”, this termis represented, and a further statement in the same document whichincludes this term is generated.

Any of the represented terms can be selected by the user from thedisplayed document generated by the application 216 and displayed at theterminal by the applet 116, by using the keyboard or mouse 106.

On such a selection, in step 712, the applet signals the selected termas a new search term to the computer 200 and the application 216, onreceipt at step 738, repeats the search at step 722 and returns a newoutput document including:

Definition data, where present, from the lexical database 234, and;

Any statements stored in the computer 200 including the selected term,as discussed above, and represented in the same manner as shown in FIG.16.

Thus, a user can move from an original search topic to related topicsand retrieve data held on those related topics.

Watching Search

The search criteria specified by the user may include a specificationthat the search is to be updated when new information meeting the searchcriteria is obtained.

The application 216 is arranged to execute such a specification in oneof two ways:

By periodically re-executing the search, but with an additional timeconstraint that only statements having timestamps within the updateperiod (and therefore after the original search) are located, or;

By setting a trigger to report any new items containing the searchcriteria to the user.

User Profiles

Some of the search data received from the terminal 100 may be stored ina user record relating to the user, which is used to modify futuresearches. In particular, the level of complexity or obscurity ofmaterial requested by the user may be used to set the level ofcomplexity of items retrieved in future searches.

Second Embodiment

The second embodiment adds to the first the option of references todocuments in media other than text. Thus, on data input, the applet isarranged to allow the user to specify, for a graphical display element,statement, or complete document, a linked image file, sound file, videofile or other related document.

The first embodiment is modified in that, after supply from the computer200 of the object data for display at the terminal 100, one of the formsof editing allowed is to specify a link to a file containing the linkeddocument, by inputting the file address (e.g. from a browse screendisplay).

The applet 116, in response, is arranged to create a visuallydistinctive graphical element representing the linked document, andidentifying its media type.

On signalling the edit results to the computer 200, either the fileitself is uplinked to the computer 200 and stored thereat by theapplication 216, or (where the document is available at another servercomputer) a reference to the address of the document (e.g. its URL) istransmitted and stored in the lexical database 234. The lexical databasemay also include stock multimedia material, in the manner of existingmultimedia encyclopaedias such as Encarta™ available from Microsoft Inc.

Further, where other data is available to the computer (e.g. in the formof a relational database holding information on one or more of thelexical entries in the lexical database 234) the application 216 alsostores a reference to such other data.

Finally, the application is arranged to store the originally-input textof the document, as it is typed in by the user and uploaded from theterminal 100, as an archived document, and the output document includesa link to the archived document, allowing it to be retrieved in itsentirety where required for direct quotation, for example.

On data retrieval, the document output to the user is a hypertextdocument, with links to cause the inclusion of such other material invarious media as is available.

Third Embodiment

In the above-described embodiments, the description assumes that thegenerated text is in the same language as the originally input text.

However, the analysis of the input text, into semantically meaningfulentity, state of affairs and modifier elements, results in arepresentation of the input statements which is substantiallylanguage-independent. The present embodiment utilises this to generatetext in languages other than that in which the document was input. Ourabove-referenced earlier patent application discloses aspects ofsuitable parsing and generation methods which may be used in thisembodiment.

Briefly, referring to FIG. 19, the computer 200 stores a plurality ofgrammar rules databases 236 a, 236 b, . . . and corresponding expansiondatabases 232 a, 232 b , . . . Each pair of databases 232, 236 relate toa given language. On text input, the user specifies the language of thetext (the source language), and the application 216 accesses theappropriate expansion and grammar rules databases to analyse the text.

The lexical database 234 in this embodiment comprises a plurality oflanguage-specific lexicons 235 a, 235 b, . . . , each containing a wordlist for the language concerned, with each word of the list including apointer to one or more entries in the lexical database 234, which storesentries comprising meaning data for meanings of each word, and a pointerback to each word for which the entry is a meaning.

As many words in different languages are directly translatable (in thesense of sharing a common meaning), many meaning entries in the lexicaldatabase 234 store pointers to words in each language. Not all words aredirectly translatable, and where meanings differ, the lexical database234 includes additional, language-specific definitions with pointersfrom words in only those languages in which they occur.

On entering a search profile, the user specifies the language of theoutput text (the target language) and the application 216 accesses thelexical database, selects the words from the target language lexiconwhich are pointed to by the lexical database entries, applies therelevant grammar rules from the target language grammar rules database236 to generate output text, and expands the word forms using the targetlanguage expansions database 232.

Words which are not directly translatable may be substituted by ahierarchically higher, directly translatable meaning (e.g. “dog” for“alastian”), and/or passed on untranslated. Alternatively, rules forabstraction or language-to-language translation may be used as disclosedin our above-referenced earlier application.

Fourth Embodiment

In earlier embodiments, the input document is typed into the terminal100 as text via the keyboard 102. In the present embodiment referring toFIG. 20, the terminal 100 is provided with a microphone 103, and theinput text is dictated and transliterated by a speech-to-text conversionprogram 115, such as ViaVoice™ available from IBM Inc.

The input speech is reproduced as text in a text input area of thescreen 104, and in other respects the present embodiment operates asdescribed above.

It is advantageous to provide the speech recognition at the terminal100, where it is possible to train on the voice of the individual user,rather than centrally. Also, since text rather than audio is uplinked,the required uplink bandwidth is kept low.

On the other hand, providing the generation centrally avoids the need tostore multiple rules databases locally at terminals.

In this embodiment, the terminal 100 may also comprise a microphone, anda text to speech program arranged to synthesise speech from the textreceived from the computer 200 to provide audio output via a loudspeaker105.

The applet 116 may also be arranged to generate a visual display torepresent the output data; for example, a representation of a humanface, or entire human head, animated in synchronism with the outputspeech as described in our earlier application EP-A-225729, or a signlanguage display comprising an animated representation of a pair ofhands generating sign language (for example British or American signlanguage) from a text to sign language converter program. This latterembodiment is particularly advantageous for those with hearingdifficulties.

Fifth Embodiment

In this embodiment, the data input aspect (but not the data retrievalaspect) of the first or fourth embodiments is utilised to derive aspecification for writing a computer program, from which such a programmay be generated automatically.

A specification should consist of a set of statements about thefunctions which a program should perform. In particular atelecommunications control program for controlling operation ofintelligent network functions will often take the form of an indicationof the actions performed in response to the occurrence of certainconditions. For example, where a called party line is busy, a “callwaiting” alerting function alerting the user to another incoming callmay be performed.

Referring to FIG. 21, the process of generating a computer programaccording to this embodiment proceeds through three broad stages; aspecification input phase 1000, a validation phase 1100, and a codegeneration phase 1200.

In the specification input phase, a series of performance statementsspecifying the functions to be performed by the program are input, asdescribed in the first embodiment. Any ambiguities in each statement aretherefore detected and corrected on input. This stage will therefore notbe described further.

In the second stage, the consistency checking performed is notexhaustive but consists of two checks. Firstly, a causality check isapplied, by generating, from each statement which implies that event Acauses an event B (for example, that a busy called line will cause acall divert), a graph indicating that A must occur before B, and thenaligning the results thus generated to determine whether any impossiblesequences (where, for example, A is expected to cause B but B isexpected to cause A) have been specified.

Secondly, some consistency checking can be performed by inference rules,using the hierarchical dictionary. For example, if in one statement allevents of a certain type are specified to lead to one result, and yet inanother statement a specific event of a hierarchically lower (i.e.narrower) class is stated to lead to a different result, theinconsistency is detected by examining, for each statement concerning anarrow (i.e. hierarchically lower) semantic element, all statements madeabout that element or hierarchically higher elements, to flaginconsistencies.

These consistency and causality checks are performed on each occasionwhen a new statement is entered. Thus, the likelihood of enteringinconsistent specifications is reduced.

Having entered the specification (step 1202), code generation may beperformed automatically using a suitable compiler for compiling fromhigh level to some description languages, or may be performed manually.

Sixth Embodiment

It will be clear from the foregoing that the invention provides a dataretrieval system in which multiple authors can conveniently work uponthe same body of information, by adding new links to concepts storedwithin the server computer 200, for retrieval by (possible multiple)different users.

In this embodiment, data input is accelerated by maintaining thefeatures of a graphical display of the input text representing storedlinkages between the information represented therein, and giving thepossibility of parsing the information. However, data input isaccelerated by permitting the graphical input of text prior to fullsemantic analysis, for subsequent semantic analysis and disambiguation.

In this embodiment, as in the preceding embodiments, the entry of texttakes place via a computer terminal running a programme (which may be an“applet” running through a hypertext browser programme).

FIG. 22 illustrates the screen display produced. It will be seen toconsist of a number of text-containing shapes. As in the previousembodiment, the shapes for intervisually distinctive classes; thus,verbs or states of affairs are represented as rectangles of a particularcolour such as green; entities are represented by a rounded rectangle,preferably also in a different colour (such as blue), and modifiers arerepresented by a lozenge shape. In this embodiment, prepositionsconnecting to entities, such as the word “of”, in English are alsorepresented graphically, as triangles in this case.

In addition to these shapes, a default text shape (shown here as a grayrectangle) is provided, into which free-form text may be entered. InFIG. 22, the box containing the words “a novel approach to managing,publishing and sharing knowledge” constitutes such a default shape.

Linking the shapes are arrows, representing graphically the connectionsbetween the components (entities, states of affairs and modifiers) inthe original text.

Referring now to FIG. 23, to input new stored text, in a step 2002, auser inputs text into the terminal 100 via the keyboard 102. In thisembodiment, unlike the previous embodiment, the user first creates a newbox (by default, a default box) by some suitable action such as“right-clicking” the mouse, and then text typed by the user on thekeyboard is entered into the box.

Existing boxes may be selected by “left-clicking” the mouse, and thetext therein amended.

Once a box ceases to be selected (for example because another box hasbeen selected), the terminal 100 transmits a signal to the server 200indicating either a creation of a new box, or the change of text in anexisting box. The server 200 creates or amends a corresponding storedrecord. Each box therefore corresponds to a record on the server 200,which initially contains the text stored in the box.

In a step 2004, a user may create links between the boxes, using themouse (for example using a “dragging” action of the mouse). Each suchlink is directional, having a source box and a destination box, and theterminal 100 is arranged to graphically represent the link by a linewith an arrow indicating the direction of the link, running between thesource and the destination boxes.

As in the previous embodiment, information on each link thus created istransmitted to the server 200 and a corresponding alteration to the datastored therein is made, to link the records representing the source anddestination boxes.

As is shown in FIG. 22, this graphical display provides a method forusers to enter text in a way which also conveys some meaning. Users canstructure and group their ideas by arranging and grouping the boxes, andconnecting them with connecting lines.

For example, the ideas in the display concerning the entity which is aprogram referred to as “FreeFlow”, use a single box to represent thatprogram in all of the different statements made about it. That box,containing the text “FreeFlow”, is connected to four boxes representingstates of affairs, containing respectively the text “is”, “overcomes”,“uses”, and “consists of”.

Each of these is then connected to further boxes; a box containing textdescribing what FreeFlow is; a box containing text describing whatFreeFlow overcomes; a box containing text describing what FreeFlow uses,and four boxes containing text describing what FreeFlow consists of.

In some respects, therefore, the display of FIG. 22 resembles anentity/relation graph, but it is unconstrained by formal logic at thisstage, and a user can simply enter text as desired and connect the text.

After this stage, then, the text entered (and stored at the server 200)is distinguished over the corresponding free text which might have beenentered as a normal document by the following features:

1. All statements about a given entity (e.g. “FreeFlow”) refer back tothe same record representing that entity.

2. Some structure is recognisable in the stored data, by virtue of thelinks.

In a step 2006, on execution of a command by the user (such as selectionof a button displayed on the display using the mouse) the data containedin the boxes is reconciled, to the extent possible, with data alreadycontained at the server 200. Accordingly, at the server 200, the textwithin each box is examined and compared with text in the lexicaldatabase 234 and any other databases of existing concepts.

In some cases, there will be an exact correspondence between the textcontained in the boxes and records held in the lexical store 134. Forexample, “is”, “uses”, “overcomes”, “of”, and “consists of”, will all berecognised as corresponding to lexical database entries.

In such cases, the text held in the record of the server 200 isaugmented by a pointer to the lexical database entry, and the shapedisplayed is selected so as to match the recognised characteristics ofthe text (for example, to indicate whether it is an entity or a state ofaffairs).

The connecting line pointer records are used to interpret the attributesof the recognised item of text; for example, if the recognised item oftext is a verb, the connecting lines may be used to determine thesubject and object of the verb.

In some cases, a recognised word will be ambiguous. For example, in thescreen shown in FIG. 23, the word “uses” can either be a form of theverb “to use”, or a plural noun. Where there is an ambiguity, the server400 causes the display on the screen of the terminal 100 of the multipleentries from the lexical database 234 to which the word could relate, asshown in FIG. 24b, and the user selects one of the meanings. Theselected meaning is signalled back to the server computer 200, and usedinstead of the text, in the record corresponding to the box displayed onthe screen of FIG. 24a.

Thus, at the conclusion of step 2006, those boxes containing text whichis recognised as corresponding to entities already stored at the server200 are replaced by links to those entities at the server 200.

In step 2008, or on selection by the user of an appropriate controlbutton displayed on the screen, the user is permitted to add new itemsto the lexical database 234 (or, conveniently, to a supplementarydatabase of additional concepts). For example, in the screen shown inFIG. 22, the concepts “FreeFlow”, “FreeFlow language”, “FreeFlowclient”, “FreeFlow generator”, “FreeFlow server”, and “FreeFlowknowledge store” may be added.

In each case, the server computer 200 prompts the user via the terminal100 to indicate whether the entry is an entity, a modifier or a state ofaffairs, and permits the user to follow, from the top, the hierarchyused by the lexical database 234 to link the new entity into thehierarchy defined thereby.

Additionally, pre-defined commonly used categories are preferablyprovided, such as “entity”, “definitions”, “document”, “person”,“product”, and “organisation”.

Having recording the new entry in the lexical database, the recordspreviously containing text are replaced or augmented by a reference tothat record in the same manner as described above.

Thus, at the end of this step, all boxes which are either recognised bythe server 200 or designated by the user as corresponding to aparticular concept are linked through pointers to and from the lexicalrecord of that concept, so that all statements held within the server200 about a given concept can be subsequently accessed.

Where the content of a box contains multiple words (for example“FreeFlow knowledge store”), the box may accordingly represent aparticular instance (e.g. the FreeFlow knowledge store as an entity inits own right) of a more general class (for example “knowledge store” ingeneral, or “store” in general). The present embodiment allows suchconcepts to be separately identified from the generic classes withinwhich they lie.

At this stage, then, text information has been input in a fashion whichallows the authors to group their thoughts logically and represent themgraphically, and permits individual ideas or concepts to be identifiedwith those already in the server 200, or to be newly stored therein forfuture use if they do not correspond to anything previously storedtherein. This is useful in its own right, since it ensures that multiplestatements about, for example, “FreeFlow” can all be accessedsubsequently by searching the server 200 for the record of that conceptand then following the links to each record thereof.

Moreover, the text of the statements can be regenerated from the recordsheld in the server, by following the links recorded between thedifferent records of each statement, in the order specified by thedirection of the links, to reconstitute the original text. However,unlike the first embodiment, the regenerated text is not fully languageindependent, since some records may contain strings of text in theoriginal source language.

Accordingly, in this embodiment, on selective actuation of a displayedbutton at the screen at the terminal 100 by the user, the analysisprocess described in the first embodiment is performed in step 2010, toparse the text held in the boxes for each statement, to attempt fully toanalyse their meanings. Accordingly, for each box containing text whichhas not yet been fully replaced by a reference to a stored item, eachword is analysed as described in the first embodiment, and thestatements containing the word are parsed, and ambiguities are displayedfor the user to resolve.

Conveniently, the original record for each box is retained, with thetext it contains, for subsequent display if desired. However, byselecting the box (e.g. by “double clicking” using a mouse) the internalstructure is displayed.

In the record for the box, a flag indicates that it is a compositecontaining several recognised concepts, and a pointer points to thenewly created records, one for each of the recognised concepts. Thus,for example, the box shown containing the text “a novel approach tomanaging, publishing and sharing knowledge” is replaced with thestructure shown in FIG. 26, in which each word is represented by aseparate record.

As just described, any words which are not recognised as correspondingto those already in the lexical store or supplement thereto may be newlydefined by the user.

Thus, in step 2012, new records for each recognised word are stored andamendments to existing records are made.

Referring to FIG. 25, data is retrieved from the server 200 in similarfashion to that described in relation to the first embodiment.Typically, the user will interrogate the server 200 via a browserprogram on a terminal 100. One or more search terms will be entered in aform which is uploaded to the server 200.

In a step 2102, the terms are looked up in the lexical store, and in astep 2104 any hierarchically lower entries or synonyms for the sameconcept are also looked up. These steps correspond to step 724 in theearlier embodiments.

In a step 726 all records pointing to the entries concerned areaccessed, as in the first embodiment.

In a step 2108, a text search is performed through the records of thoseobjects which have not been fully replaced by references to the lexicaldatabase. Any records thus located which contain the input text areadded to the list of objects located in step 726.

In step 724, as in the previous embodiment, the output text is generatedcorresponding to the statements containing the detected search terms,using the link data. Any text stored in records is directly reproduced.

In step 736, as in the first embodiment, the text is transmitted to theterminal 100.

Even where text has been retained in some records, it is still possibleas in the first embodiment to make use of the links and any neighbouringrecords which have been recognised as corresponding to entries in thelexical database, to perform some reasoning and inference operations onthe text held in the records.

In addition to, or as an alternative to, the generation of textdirectly, a graphical display corresponding to that of FIG. 22 may bereproduced, illustrating the structure of the retrieved statementsgraphically.

In this case, selecting any record using the mouse at the terminal 100causes the display of further statements which point to the same lexicalentry as does that record, to provide information retrieval in a mannersimilar to hypertext.

It will be seen that this embodiment combines the advantages of theearlier embodiments with the provision of an intermediate graphical formof input, which is convenient to authors while still allowing someparsing and inference of the input information.

Other Modifications and Embodiments

The consistency checking features of the above described fifthembodiment may, of course, be used with any of the preceding embodimentsand, in general, unless the contrary is indicated, features of the abovedescribed embodiment may freely be combined.

The foregoing embodiments are merely examples of the invention and arenot intended to be limiting, it being understood that many otheralternatives and variants are possible within the scope of theinvention. Protection is sought for any and all novel subject matterdisclosed herein and combinations of such subject matter.

What is claimed is:
 1. Apparatus for use in identifying semanticambiguities, comprising (i) input means, including a text input means;(ii) storage means; (iii) processing means arranged to parse input textto a. identify semantically meaningful elements in the input text; b.derive relationships between the identified semantically meaningfulelements; and c. store a group of the identified semantically meaningfulelements, and data defining the derived relationships, in the storagemeans; (iv) graphical display means, which, in response to said parsingof the input text, is arranged to generate a visual representation ofsaid input text in which the identified semantically meaningful elementsare represented by pictorial elements and the derived relationshipstherebetween are represented by linking elements interconnecting thepictorial elements.
 2. Apparatus as in claim 1 wherein: said input meansfurther comprises a graphical input means arranged to interact with saidvisual representation to allow editing of said semantically meaningfulelements identified from the input text, and said graphical displaymeans is arranged correspondingly to update said visual representation.3. Apparatus as in claim 1 wherein said storage means comprises alexical store storing an entry for each possible said semanticallymeaningful element.
 4. Apparatus as in claim 3 wherein each said entryincludes meaning data relating to the meaning of the correspondingelement.
 5. Apparatus as in claim 4 wherein said processing means isarranged to analyse at least one said group in accordance with themeaning data stored for the corresponding entries in the lexical store.6. Apparatus as in claim 5 wherein said processing means is arranged tolocate inconsistencies between different elements in said at least onegroup.
 7. Apparatus for generation of a computer program from a naturallanguage source document specification of the function thereto, saidapparatus comprising: input means as in claim 6, arranged to input partor all of said source document specification, and to test said input forsemantic ambiguity, and a code generator means for generating code whenno further ambiguity is detected, for analyzing said source documentspecification for inconsistency, and for generating code when no furtherinconsistency is detected.
 8. Apparatus as in claim 3 wherein saidlexical store is arranged to store an indication of the level offamiliarity of the corresponding semantically meaningful element tousers.
 9. Apparatus as in claim 3, arranged to receive source documentsin a plurality of languages, in which said possible semanticallymeaningful element is a word and wherein corresponding words share acommon entry in said lexical store.
 10. Apparatus as in claim 1 whereinsaid storage means is arranged to store, for said semanticallymeaningful elements identified from the input text, timestamp dataindicating the time of origin of said semantically meaningful elementsidentified from the input text.
 11. Apparatus as in claim 1 wherein saidstorage means is arranged to store, for semantically meaningful elementswithin said text, access level information.
 12. Apparatus as in claim 1wherein said storage means is arranged to store data representing asource document of said text, in addition to said semanticallymeaningful elements identified from the input text relating to saidtext.
 13. Apparatus as in claim 1 wherein: said text includes, or refersto, document components in media other than that of said document, andsaid storage means is arranged to store said document components orreferences thereto, with said semantically meaningful elementsidentified from the input text.
 14. Apparatus as in claim 1 wherein saidinput means comprises a speech recogniser.
 15. Apparatus as in claim 1wherein said input means comprises a keyboard.
 16. Apparatus for dataretrieval of semantically meaningful elements identified from the inputtext which has been input by apparatus as in claim 1, said apparatus fordata retrieval, comprising: additional storage means storing saidsemantically meaningful elements identified from the input text; queryinput means for inputting search parameters; processing means forretrieving one or more of said at least one group.
 17. Apparatus as inclaim 16, further comprising generating means for generating an outputdocument from said one or more group.
 18. Apparatus as in claim 17,further comprising output means for outputting said output document. 19.Apparatus as in claim 17 wherein said generating means is arranged togenerate said output document in at least one language different to thelanguage in which the text was input.
 20. Apparatus as in claim 18wherein said output means comprises a text display terminal. 21.Apparatus as in claim 18 wherein said output means comprises a speechsynthesiser.
 22. Apparatus as in claim 18 wherein said output meanscomprises an animated display generator.
 23. Apparatus as in claim 22wherein said animated display generator generates a sign languagedisplay.
 24. Apparatus as in claim 22 wherein said animated displaygenerator generates a representation of a speaking human face. 25.Apparatus as in claim 16 for retrieving semantically meaningful elementsidentified from the input text stored by an apparatus for use inidentifying semantic ambiguities, comprising (i) input means, includinga text input means; (ii) storage means; (iii) processing means arrangedto parse input text to a. identify semantically meaningful elements inthe input text; b. derive relationships between the identifiedsemantically meaningful elements; and c. storage a group of theidentified semantically meaningful elements, and data defining thederived relationships, in the storage means; and (iv) graphical displaymeans, which, in response to said parsing of the input text, is arrangedto generate a visual representation of said input text in which theidentified semantically meaningful elements are represented by pictorialelements and the derived relationships therebetween are represented bylinking elements interconnecting the pictorial elements; wherein saidprocessing means is arranged to receive search parameters definingplural search criteria, and to analyse said relationships in dependenceupon said criteria, and to output groups which meet said criteria independence upon said analysis.
 26. Apparatus as in claim 16 forretrieving semantically meaningful elements identified from the inputtext stored by apparatus wherein: said storage means comprises a lexicalstore storing an entry for each possible said semantically meaningfulelement; each said entry includes meaning data relating to the meaningof said element, and said processing means is arranged to analyse saidat least one group in accordance with the meaning data stored for thecorresponding entries in the lexical store, and to select at least onegroup for output in dependence thereupon.
 27. Apparatus as in claim 16,for retrieving semantically meaningful elements identified from theinput text stored by apparatus wherein: said storage means comprises alexical store storing an entry for each possible said semanticallymeaningful element; said lexical store being arranged to store anindication of the level of familiarity of the corresponding semanticallymeaningful element to users; and said processing means being arranged toselect at least one group for output in dependence upon said familiaritydata.
 28. Apparatus as in claim 16, for retrieving semanticallymeaningful elements identified from the input text stored by apparatuswherein: said storage means is arranged to store, for semanticallymeaningful elements within said input text, access level information;and said processing means being arranged to select semanticallymeaningful elements for output in accordance with said access levelinformation.
 29. Apparatus as in claim 28 further comprising generatormeans for generating an output document from said at least one groupwherein: said processing means is arranged selectively to suppressoutput of selected said select semantically meaningful elements withinsaid at least one group in accordance with said access levelinformation, and said generator means is arranged to generate a documentfrom the remainder of the content of said at least one group. 30.Apparatus for generation of a computer program from a natural languagesource document specification of the function thereof, said apparatuscomprising: input means as in claim 1, arranged to input part or all ofsaid source document specification, and to test said input for semanticambiguity, and a code generator means for generating code when nofurther ambiguity is detected.
 31. Apparatus as in claim 1, in whichsaid storage means is arranged to store data representing the author ofsaid text.
 32. Apparatus according to claim 1 wherein the processingmeans is operable to store one or more further groups of datarepresenting alternative relationships between the identifiedsemantically meaningful elements.
 33. Apparatus according to claim 1wherein the input means further includes query input means for inputtingsearch parameters, which query input means is in operative associationwith the processing means, and is arranged to pass the search parametersto the processing means for parsing thereof, whereupon the processingmeans identifies semantically meaningful elements, and relationshipstherebetween, from the search parameters, the apparatus furthercomprising retrieving means for retrieving data from the said storagemeans in accordance with the parsed search parameters.
 34. Apparatus foruse in identifying semantic ambiguities, said apparatus comprising: (i)input means including a text input means; (ii) storage means; (iii)processing means arranged to parse input text to a. identifysemantically meaningful elements in the input text; b. deriverelationships between the identified semantically meaningful elements;c. store a group of the identified semantically meaningful elements, anddata defining the derived relationships, in the storage means; d. storeaccess level information associated with the semantically meaningfulelements in the group; and graphical display means, which, in responseto said parsing of the input text, is arranged to generate a visualrepresentation of said input text in which the identified semanticallymeaningful elements are represented by pictorial elements and thederived relationships therebetween are represented by linking elementsinterconnecting the pictorial elements.
 35. Apparatus as in claim 34wherein said storage comprises a lexical store storing an entry for eachpossible said semantically meaningful entry.
 36. Apparatus as in claim35 wherein each said entry includes meaning data relating to the meaningof said entity.
 37. Apparatus as in claim 36 wherein said processor isarranged to analyze said semantically meaningful elements in accordancewith the meaning data stored for the corresponding entries in thelexical store.
 38. Apparatus as in claim 37 wherein said processor isarranged to locate inconsistencies between different said semanticallymeaningful elements.
 39. Apparatus as in claim 35 wherein said lexicalstore is arranged to store an indication of the level of familiarity ofthe corresponding semantically meaningful element to users. 40.Apparatus as in claim 35, arranged to receive source documents in aplurality of languages, in which corresponding words share a commonentry in said lexical store.
 41. Apparatus as in claim 34 wherein saidstorage is arranged to store, for said content data, timestamp dataindicating the time of origin of said semantically meaningful elementsidentified from the input text.